Dusan Birtasevic
Kavian Mashayekhi
Narjes Amusoltani
Tina Khazaee
This computer vision project introduces a method for generating facial 3D images/clips from 2D images, with a focus on enhancing facial details and creating captivating 3D animations.
The pipeline combines YOLOv8 for face detection, an Efficient Sub-Pixel Convolutional Neural Network (CNN) for Image Super-Resolution, Real ESRGAN for realistic enhancement, and the First Order Motion Model for animation synthesis.
Initially, YOLOv8 accurately localizes and extracts facial regions from input 2D images. These faces are then cropped and passed to the Efficient Sub-Pixel CNN, a powerful network that reconstructs high-resolution facial images, significantly improving facial quality and detail.
Input Image:
Prediction (yolov8_face_detection Notebook):
Bounding Box Crop (yolov8_face_detection Notebook):
To further enhance the upscaled images, Real ESRGAN, a specialized Generative Adversarial Network (GAN) for super-resolution tasks, is utilized to generate visually realistic and finely-detailed facial representations.
Lastly, the First Order Motion Model breathes life into the enhanced facial images, allowing for the creation of dynamic 3D animations. The model transfers facial movements from source videos to the improved 3D facial representations, resulting in realistic and captivating visualizations.
Enhanced Image (ESRGAN_Image_Enhancement Notebook):
Animate 2d Picture (First Order Motion Notebook):
Enhance Video (ESRGAN_Video_Enhancement Notebook):
Our primary motivation was to enhance facial details in 2D images and produce compelling 3D animations. The potential applications of this technology range from aiding suspect identification for law enforcement to capturing evidence of thieves caught on security cameras.
Through comprehensive experimentation and evaluation, our approach demonstrates substantial improvements in facial details, animation realism, and overall visual appeal. This project contributes to the field of computer vision by opening up possibilities for advancements in facial enhancement and animation synthesis, with promising applications in security, entertainment, and various other domains.
Our project focuses on the fascinating task of transforming 2D images into realistic and dynamic 3D representations. The process of converting flat images into immersive 3D scenes presents a challenging problem due to the absence of depth information in 2D format. To address this issue, we aim to develop an efficient and user-friendly solution that automates the generation of 3D images and clips, making it accessible to a wide range of users.
Creating 3D content traditionally involves labor-intensive manual processes and specialized software. This limits its widespread adoption and inhibits its potential impact across various industries. Our project seeks to overcome the barriers associated with 3D content creation by developing an automated approach that requires minimal user intervention, thus democratizing the accessibility of 3D visuals.
The ability to produce 3D images and clips from 2D sources holds great significance for industries like entertainment, education, design, and marketing. Enabling a broader audience, including non-experts, to generate 3D content can lead to the proliferation of more engaging and interactive media. Moreover, by reducing the time and skill requirements, our solution can empower creative professionals and businesses to enhance their visual communication and storytelling capabilities.
Through our research and development, we have devised an innovative algorithm that utilizes advanced computer vision and deep learning techniques. Our algorithm achieves impressive results by accurately inferring depth information from 2D images and translating it into compelling 3D renditions. The generated 3D images and clips exhibit a convincing level of realism and immersion, mirroring the characteristics of manually crafted 3D content.
For this project to be done, we implemented different other project, tried to tailor them and fine tune them to achieve the best possible result.
At first, the idea behind this project comes from My Heritage Deep Nostalgia and this project is a mimic of that.
Secondly, we took advantage of First Order Motion Model for Image Animation as our core model for creating 3D images/clips from 2D.
In addition, Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data paper and the linked notebook was used and fine tuned in our work.
Also we tried to use Image Super-Resolution using an Efficient Sub-Pixel CNN in order to enhance the detail of our input images first and then pass them into the 2D to 3D model.
For YOLO, pre-trained weights model was used.model location: https://github.com/akanametov/yolov8-face
In order for image enhancement part, BSDS500 (Berkeley Segmentation Dataset 500) was implemented. This dataset is designed for evaluating natural edge detection that includes not only object contours but also object interior boundaries and background boundaries. It includes 500 natural images with carefully annotated boundaries collected from multiple users.
The structure of this dataset was so unique and we were able to retrieve the required data using this source.
In order to fine tune the trained model in "Image Enhancement" notebook, we decided to do the fine tuning on a dataset from kaggle.
The chosen dataset was the following dataset:
https://www.kaggle.com/datasets/sanidhyak/human-face-emotions
This dataset has over 250 of Sad, Angry and Happy face images. All of these categories were merged and splited to train, test and validation for fine tuning the pre-trained model.
The provided code implements face detection using YOLOv8. The pre-trained YOLOv8 face detection model is loaded, and an image is uploaded for processing. The model predicts faces' bounding box coordinates, which are then visualized on the original image. Additionally, the code crops and saves individual face images based on the detected bounding boxes. This methodology enables efficient face detection and provides the ability to extract and save individual face images for further analysis or processing.
The other step before feeding the image into the model for 3D image generation was to enhance the image details and resolution. The reason of doing this step is that the generator model performance and output quality is directly related and depended to the quality of input image.
So, we decided to do this step after detection with YOLO and before feeding the image to the model in order to achieve a better output from the main model.
Real-ESRGAN is a model designed for upscaling and enhancing real-world images with state-of-the-art performance. It builds on the principles of Generative Adversarial Networks (GANs) and Efficient Sub-pixel Convolutional Neural Networks (ESPCN), and leverages a combination of techniques to achieve its results. Here's how it works:
Architecture: Real-ESRGAN combines a generator network and a discriminator network. The generator network is tasked with upsampling an input image, while the discriminator network tries to distinguish the upscaled images from real high-resolution ones. This adversarial process pushes the generator to produce increasingly convincing high-resolution images.
ESPCN (Efficient Sub-Pixel Convolutional Neural Network): ESPCN focuses on increasing spatial resolution efficiently. By using sub-pixel convolution, the network rearranges elements in the high-dimensional space into the spatial high-resolution domain. It's a computationally efficient way of achieving higher resolution without resorting to heavy transpose convolutions. It also uses filters to synthesize ringin and overshoot artifacts for training pairs.
where (i, j) is the kernel coordinate; ωc is the cutoff frequency; and J1 is the first order Bessel function of the first
kind
Pre-trained Models and Fine-tuning: Real-ESRGAN usually leverages pre-trained models and fine-tunes them on specific tasks or domains. This transfer learning approach significantly reduces the training time and resources needed.
Loss Functions: It uses several loss functions to train the generator, including:
Real-world Degradation Modeling: One innovation in Real-ESRGAN is its focus on real-world image degradation instead of synthetic degradations (e.g., downscaling with simple filters). The training dataset incorporates real-world low-quality images, allowing the network to learn the complex and various degradations present in real scenarios.
Results: The final result is an enhanced and upscaled image that not only has higher pixel resolution but also exhibits improvements in sharpness, detail, and overall visual quality.
Real-ESRGAN represents a combination of several advanced techniques in machine learning and computer vision. It offers significant improvements over earlier super-resolution models, especially when dealing with real-world, non-ideal, and varied images. Its adaptability and performance make it valuable in various applications, ranging from restoring old films to enhancing satellite imagery.
The First Order Motion Model is a deep learning approach designed to animate a given image using the motion extracted from a driving video. It's widely used for applications like facial animation, video editing, and avatar creation. Here's how the First Order Motion Model works:
Input: The model takes two primary inputs:
Motion Extraction: The driving video is fed into the model to extract motion information. This is usually done by employing a set of keypoints that represent the essential parts of the image (such as facial landmarks if the image is a face).
Keypoint Detector: A specific module called a keypoint detector is used to find keypoints in both the source image and the driving video. This helps in understanding the structure and motion between the two.
Keypoint Descriptor: Besides the keypoints, a local region around each keypoint, referred to as a descriptor, is extracted. This helps in understanding the texture and appearance around the keypoints.
Motion Representation: The motion between the source image and the driving video is represented as a sparse set of keypoints and dense local changes in the appearance around the keypoints. The model calculates the difference in keypoints and descriptors between the source image and each frame of the driving video. Formally, the transformation of the ( k )-th region from the reference frame to the image is computed as:$$ A^k_X\leftarrow_R \in \mathbb{R}^{2 \times 3} $$
Loss Functions: The training process uses several loss functions:
Result: The output is a sequence of images or a video where the source image is animated according to the motion extracted from the driving video.
What's unique about the First Order Motion Model is that it doesn't require paired training data, meaning you don't need examples of the source image and the corresponding animated sequence during training. This makes the model highly flexible and applicable to a wide variety of images and motions.
In this experiment, we focused on implementing YOLOv8 for face detection as part of our 2D to 3D model pipeline. We obtained the YOLOv8 face detection model from a public GitHub repository (https://github.com/akanametov/yolov8-face) and used it to perform face detection on uploaded images. The model was loaded and used to predict bounding box coordinates for detected faces, which were then visualized on the original images. Additionally, we extracted and saved individual face images based on the detected bounding boxes.
This allowed us to efficiently detect faces in input images and obtain higher-resolution face crops for further processing. The YOLOv8 model's real-time capabilities and state-of-the-art performance made it a valuable component in our 2D to 3D model, enabling accurate and detailed face detection, ultimately improving the quality and resolution of the input images used in the 2D to 3D generation process. The results of this experiment were promising and contributed significantly to the overall success of our 2D to 3D model.
As it discussed and explained before, in this step we decided to implement a CNN model to increase the resolution and image quality of our input image before feeding it to the model for 3D generation.
The source of training this model was a notebook for enhancement of picture details that could be found here: Image Super-Resolution using an Efficient Sub-Pixel CNN.
We implemented this Efficient Sub-Pixel CNN to increase the details of the 2D images input of our 2D to 3D model.
Here is the link to the complete notebook: LINK
Below, we will discuss the main part of the implementation.
One of the most important parts of this training was to prepare the dataset in a way that we have low resolution images from one hand and have the original high quality images on the other hand. This could make our model able to be trained and it was also a good metric for us to evaluate the performance of the model.
So, for pre-processing first we changed the color space from RGB to YUV.
For the input data (low-resolution images), we crop the image, retrieve the y channel (luninance), and resize it with the area method (use BICUBIC if you use PIL). We only consider the luminance channel in the YUV color space because humans are more sensitive to luminance change.
For the target data (high-resolution images), we just crop the image and retrieve the y channel.
# Use TF Ops to process:
def process_input(input, input_size, upscale_factor):
input = tf.image.rgb_to_yuv(input)
last_dimension_axis = len(input.shape) - 1
y, u, v = tf.split(input, 3, axis=last_dimension_axis)
return tf.image.resize(y, [input_size, input_size], method="area")
def process_target(input):
input = tf.image.rgb_to_yuv(input)
last_dimension_axis = len(input.shape) - 1
y, u, v = tf.split(input, 3, axis=last_dimension_axis)
return y
train_ds = train_ds.map(
lambda x: (process_input(x, input_size, upscale_factor), process_target(x))
)
train_ds = train_ds.prefetch(buffer_size=32)
valid_ds = valid_ds.map(
lambda x: (process_input(x, input_size, upscale_factor), process_target(x))
)
valid_ds = valid_ds.prefetch(buffer_size=32)
So, using the above pre-processing resulted in the following low-resolution images comparing the original ones.
for batch in train_ds.take(1):
for img in batch[0]:
display(array_to_img(img))
for img in batch[1]:
display(array_to_img(img))
Then we defined a model as below:
def get_model(upscale_factor=3, channels=1):
conv_args = {
"activation": "relu",
"kernel_initializer": "Orthogonal",
"padding": "same",
}
inputs = keras.Input(shape=(None, None, channels))
x = layers.Conv2D(64, 5, **conv_args)(inputs)
x = layers.Conv2D(64, 3, **conv_args)(x)
x = layers.Conv2D(32, 3, **conv_args)(x)
x = layers.Conv2D(channels * (upscale_factor ** 2), 3, **conv_args)(x)
outputs = tf.nn.depth_to_space(x, upscale_factor)
return keras.Model(inputs, outputs)
The summary of our model defined to be as follows:
early_stopping_callback = keras.callbacks.EarlyStopping(monitor="loss", patience=10)
checkpoint_filepath = "/tmp/checkpoint"
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor="loss",
mode="min",
save_best_only=True,
)
model = get_model(upscale_factor=upscale_factor, channels=1)
model.summary()
callbacks = [ESPCNCallback(), early_stopping_callback, model_checkpoint_callback]
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.Adam(learning_rate=0.001)
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, None, None, 1)] 0
conv2d (Conv2D) (None, None, None, 64) 1664
conv2d_1 (Conv2D) (None, None, None, 64) 36928
conv2d_2 (Conv2D) (None, None, None, 32) 18464
conv2d_3 (Conv2D) (None, None, None, 9) 2601
tf.nn.depth_to_space (TFOp (None, None, None, 1) 0
Lambda)
=================================================================
Total params: 59657 (233.04 KB)
Trainable params: 59657 (233.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Using defined some utility functions from the reference of this work, we were able to detect fitting process step by step and observe the performance of the model after each 20 epochs.
epochs = 100
model.compile(
optimizer=optimizer, loss=loss_fn,
)
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=valid_ds, verbose=2
)
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)
Epoch 1/100 Mean PSNR for epoch: 21.69 1/1 [==============================] - 0s 102ms/step
50/50 - 17s - loss: 0.0298 - val_loss: 0.0068 - 17s/epoch - 344ms/step Epoch 2/100 Mean PSNR for epoch: 24.64 50/50 - 17s - loss: 0.0051 - val_loss: 0.0033 - 17s/epoch - 348ms/step Epoch 3/100 Mean PSNR for epoch: 25.54 50/50 - 17s - loss: 0.0036 - val_loss: 0.0029 - 17s/epoch - 334ms/step Epoch 4/100 Mean PSNR for epoch: 26.23 50/50 - 17s - loss: 0.0031 - val_loss: 0.0027 - 17s/epoch - 332ms/step Epoch 5/100 Mean PSNR for epoch: 25.97 50/50 - 16s - loss: 0.0030 - val_loss: 0.0026 - 16s/epoch - 328ms/step Epoch 6/100 Mean PSNR for epoch: 26.20 50/50 - 17s - loss: 0.0029 - val_loss: 0.0025 - 17s/epoch - 332ms/step Epoch 7/100 Mean PSNR for epoch: 26.24 50/50 - 17s - loss: 0.0028 - val_loss: 0.0025 - 17s/epoch - 334ms/step Epoch 8/100 Mean PSNR for epoch: 26.18 50/50 - 16s - loss: 0.0028 - val_loss: 0.0025 - 16s/epoch - 329ms/step Epoch 9/100 Mean PSNR for epoch: 26.35 50/50 - 17s - loss: 0.0029 - val_loss: 0.0025 - 17s/epoch - 336ms/step Epoch 10/100 Mean PSNR for epoch: 26.20 50/50 - 17s - loss: 0.0028 - val_loss: 0.0024 - 17s/epoch - 345ms/step Epoch 11/100 Mean PSNR for epoch: 26.32 50/50 - 22s - loss: 0.0027 - val_loss: 0.0024 - 22s/epoch - 448ms/step Epoch 12/100 Mean PSNR for epoch: 26.05 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 358ms/step Epoch 13/100 Mean PSNR for epoch: 26.21 50/50 - 18s - loss: 0.0028 - val_loss: 0.0024 - 18s/epoch - 359ms/step Epoch 14/100 Mean PSNR for epoch: 26.40 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 356ms/step Epoch 15/100 Mean PSNR for epoch: 26.47 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 357ms/step Epoch 16/100 Mean PSNR for epoch: 26.41 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 352ms/step Epoch 17/100 Mean PSNR for epoch: 25.82 50/50 - 18s - loss: 0.0029 - val_loss: 0.0024 - 18s/epoch - 353ms/step Epoch 18/100 Mean PSNR for epoch: 26.41 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 351ms/step Epoch 19/100 Mean PSNR for epoch: 26.77 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 352ms/step Epoch 20/100 Mean PSNR for epoch: 26.58 50/50 - 23s - loss: 0.0027 - val_loss: 0.0024 - 23s/epoch - 467ms/step Epoch 21/100 Mean PSNR for epoch: 26.52 1/1 [==============================] - 0s 57ms/step
50/50 - 22s - loss: 0.0026 - val_loss: 0.0024 - 22s/epoch - 434ms/step Epoch 22/100 Mean PSNR for epoch: 26.22 50/50 - 23s - loss: 0.0026 - val_loss: 0.0023 - 23s/epoch - 460ms/step Epoch 23/100 Mean PSNR for epoch: 26.60 50/50 - 21s - loss: 0.0027 - val_loss: 0.0023 - 21s/epoch - 422ms/step Epoch 24/100 Mean PSNR for epoch: 26.56 50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 353ms/step Epoch 25/100 Mean PSNR for epoch: 27.01 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step Epoch 26/100 Mean PSNR for epoch: 26.30 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step Epoch 27/100 Mean PSNR for epoch: 27.18 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step Epoch 28/100 Mean PSNR for epoch: 26.40 50/50 - 18s - loss: 0.0026 - val_loss: 0.0024 - 18s/epoch - 352ms/step Epoch 29/100 Mean PSNR for epoch: 26.63 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 352ms/step Epoch 30/100 Mean PSNR for epoch: 26.43 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step Epoch 31/100 Mean PSNR for epoch: 26.01 50/50 - 17s - loss: 0.0028 - val_loss: 0.0024 - 17s/epoch - 346ms/step Epoch 32/100 Mean PSNR for epoch: 26.44 50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 33/100 Mean PSNR for epoch: 26.89 50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 34/100 Mean PSNR for epoch: 26.50 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 348ms/step Epoch 35/100 Mean PSNR for epoch: 26.69 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 351ms/step Epoch 36/100 Mean PSNR for epoch: 26.83 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 37/100 Mean PSNR for epoch: 26.58 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 38/100 Mean PSNR for epoch: 26.76 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 347ms/step Epoch 39/100 Mean PSNR for epoch: 26.65 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 347ms/step Epoch 40/100 Mean PSNR for epoch: 26.40 50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 357ms/step Epoch 41/100 Mean PSNR for epoch: 26.45 1/1 [==============================] - 0s 47ms/step
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 364ms/step Epoch 42/100 Mean PSNR for epoch: 26.68 50/50 - 19s - loss: 0.0026 - val_loss: 0.0023 - 19s/epoch - 381ms/step Epoch 43/100 Mean PSNR for epoch: 26.80 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 44/100 Mean PSNR for epoch: 26.84 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 45/100 Mean PSNR for epoch: 26.48 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 46/100 Mean PSNR for epoch: 26.26 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 47/100 Mean PSNR for epoch: 26.58 50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 350ms/step Epoch 48/100 Mean PSNR for epoch: 26.32 50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 49/100 Mean PSNR for epoch: 26.54 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 50/100 Mean PSNR for epoch: 26.42 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 347ms/step Epoch 51/100 Mean PSNR for epoch: 26.67 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 347ms/step Epoch 52/100 Mean PSNR for epoch: 26.45 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 348ms/step Epoch 53/100 Mean PSNR for epoch: 26.91 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 345ms/step Epoch 54/100 Mean PSNR for epoch: 26.56 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 346ms/step Epoch 55/100 Mean PSNR for epoch: 26.91 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 56/100 Mean PSNR for epoch: 26.81 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step Epoch 57/100 Mean PSNR for epoch: 26.70 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step Epoch 58/100 Mean PSNR for epoch: 26.45 50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 354ms/step Epoch 59/100 Mean PSNR for epoch: 26.82 50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 350ms/step Epoch 60/100 Mean PSNR for epoch: 26.45 50/50 - 17s - loss: 0.0027 - val_loss: 0.0022 - 17s/epoch - 347ms/step Epoch 61/100 Mean PSNR for epoch: 26.75 1/1 [==============================] - 0s 46ms/step
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 368ms/step Epoch 62/100 Mean PSNR for epoch: 26.27 50/50 - 19s - loss: 0.0025 - val_loss: 0.0022 - 19s/epoch - 384ms/step Epoch 63/100 Mean PSNR for epoch: 26.37 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step Epoch 64/100 Mean PSNR for epoch: 27.33 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 349ms/step Epoch 65/100 Mean PSNR for epoch: 26.89 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step Epoch 66/100 Mean PSNR for epoch: 25.87 50/50 - 17s - loss: 0.0027 - val_loss: 0.0026 - 17s/epoch - 346ms/step Epoch 67/100 Mean PSNR for epoch: 26.67 50/50 - 18s - loss: 0.0026 - val_loss: 0.0022 - 18s/epoch - 356ms/step Epoch 68/100 Mean PSNR for epoch: 26.70 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 69/100 Mean PSNR for epoch: 26.46 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step Epoch 70/100 Mean PSNR for epoch: 26.71 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 346ms/step Epoch 71/100 Mean PSNR for epoch: 26.62 50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 343ms/step Epoch 72/100 Mean PSNR for epoch: 26.90 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 73/100 Mean PSNR for epoch: 26.73 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step Epoch 74/100 Mean PSNR for epoch: 26.54 50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 354ms/step Epoch 75/100 Mean PSNR for epoch: 26.88 50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 351ms/step Epoch 76/100 Mean PSNR for epoch: 26.55 50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 351ms/step Epoch 77/100 Mean PSNR for epoch: 24.51 50/50 - 17s - loss: 0.0028 - val_loss: 0.0036 - 17s/epoch - 344ms/step Epoch 78/100 Mean PSNR for epoch: 26.89 50/50 - 17s - loss: 0.0029 - val_loss: 0.0022 - 17s/epoch - 346ms/step Epoch 79/100 Mean PSNR for epoch: 26.93 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step Epoch 80/100 Mean PSNR for epoch: 27.01 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 81/100 Mean PSNR for epoch: 26.90 1/1 [==============================] - 0s 47ms/step
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 367ms/step Epoch 82/100 Mean PSNR for epoch: 26.66 50/50 - 19s - loss: 0.0025 - val_loss: 0.0022 - 19s/epoch - 382ms/step Epoch 83/100 Mean PSNR for epoch: 26.85 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 342ms/step Epoch 84/100 Mean PSNR for epoch: 26.70 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 85/100 Mean PSNR for epoch: 26.85 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step Epoch 86/100 Mean PSNR for epoch: 26.18 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 87/100 Mean PSNR for epoch: 26.51 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 88/100 Mean PSNR for epoch: 26.40 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step Epoch 89/100 Mean PSNR for epoch: 26.58 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 341ms/step Epoch 90/100 Mean PSNR for epoch: 26.39 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step Epoch 91/100 Mean PSNR for epoch: 26.36 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step Epoch 92/100 Mean PSNR for epoch: 26.84 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 350ms/step Epoch 93/100 Mean PSNR for epoch: 26.81 50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 351ms/step Epoch 94/100 Mean PSNR for epoch: 26.65 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step Epoch 95/100 Mean PSNR for epoch: 26.09 50/50 - 17s - loss: 0.0025 - val_loss: 0.0024 - 17s/epoch - 343ms/step Epoch 96/100 Mean PSNR for epoch: 26.47 50/50 - 17s - loss: 0.0027 - val_loss: 0.0022 - 17s/epoch - 345ms/step Epoch 97/100 Mean PSNR for epoch: 26.39 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 346ms/step Epoch 98/100 Mean PSNR for epoch: 26.33 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 343ms/step Epoch 99/100 Mean PSNR for epoch: 26.58 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step Epoch 100/100 Mean PSNR for epoch: 27.09 50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x16a616e90>
In order to fine tune the trained model in "Image Enhancement" notebook, we decided to do the fine tuning on a dataset from kaggle.
The chosen dataset was the following dataset:
https://www.kaggle.com/datasets/sanidhyak/human-face-emotions
Despite all of the efforts done, which could be seen below, we were not able to feed pictures to the pre-trained model because it needed a special kind of input. Although we tried to use the model pre-processing method, we were not able to feed the pictures to the model again.
Finally, we moved on and we decided to use the .h5 pre-trained model directly as image enhancement method.
The main related notebook could be found here: LINK
Also, a brief explanation of all the effort was done for this part could be seen below:
As per our main trained model for increasing the resolution, we tried to decrease the quality of the images by down scaling them with a factor of 0.5. As we had Jpeg, PNG, and gif, we made sure to cover all of them for this down scaling.
input_dir = "/Users/kavian/Desktop/data/high_resolution_images"
output_dir = "/Users/kavian/Desktop/data/low_resolution_images"
scale_factor = 0.5 # Adjust this to set the desired low-resolution scale factor
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for filename in os.listdir(input_dir):
if filename.endswith(".jpg") or filename.endswith(".png") or filename.endswith(".gif") :
img = Image.open(os.path.join(input_dir, filename))
low_res_img = img.resize((int(img.width * scale_factor), int(img.height * scale_factor)), Image.LANCZOS)
low_res_img.save(os.path.join(output_dir, filename))
Then after spliting dataset to test, train, and validation, we imported the model that was trained in the previous step. We let all of its layers to be re-trainable in order to have a better fine tuned model. However, the main reason that we allowed this was the fact that this model was not too complicated and deep, and it was possible to re-train all of the layers once again even using CPU.
model = tf.keras.models.load_model('/Users/kavian/Desktop/GBC/Second Semester/6- DL II Math/Final Project/DLIIMathProject/Notebooks/Image Enhancement/Image_Enhancement_before_finetuning.h5')
print(model.summary())
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, None, None, 1)] 0
conv2d (Conv2D) (None, None, None, 64) 1664
conv2d_1 (Conv2D) (None, None, None, 64) 36928
conv2d_2 (Conv2D) (None, None, None, 32) 18464
conv2d_3 (Conv2D) (None, None, None, 9) 2601
tf.nn.depth_to_space (TFOp (None, None, None, 1) 0
Lambda)
=================================================================
Total params: 59657 (233.04 KB)
Trainable params: 59657 (233.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Finally, we tried to fit this model on our new dataset but we encountered the following error again and again:
# Train the model on the new dataset
model.fit(
train_generator,
steps_per_epoch=train_generator.samples // batch_size,
validation_data=validation_generator,
validation_steps=validation_generator.samples // batch_size,
epochs=num_epochs
)
Epoch 1/10
--------------------------------------------------------------------------- UnimplementedError Traceback (most recent call last) Cell In[16], line 2 1 # Train the model on the new dataset ----> 2 model.fit( 3 train_generator, 4 steps_per_epoch=train_generator.samples // batch_size, 5 validation_data=validation_generator, 6 validation_steps=validation_generator.samples // batch_size, 7 epochs=num_epochs 8 ) File ~/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs) 67 filtered_tb = _process_traceback_frames(e.__traceback__) 68 # To get the full stack trace, call: 69 # `tf.debugging.disable_traceback_filtering()` ---> 70 raise e.with_traceback(filtered_tb) from None 71 finally: 72 del filtered_tb File ~/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 51 try: 52 ctx.ensure_initialized() ---> 53 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 54 inputs, attrs, num_outputs) 55 except core._NotOkStatusException as e: 56 if name is not None: UnimplementedError: Graph execution error: Detected at node 'model/conv2d/Relu' defined at (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module> app.launch_new_instance() File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/traitlets/config/application.py", line 1043, in launch_instance app.start() File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 725, in start self.io_loop.start() File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start self.asyncio_loop.run_forever() File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 604, in run_forever self._run_once() File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue await self.process_one() File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 502, in process_one await dispatch(*args) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell await result File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 729, in execute_request reply_content = await reply_content File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 422, in do_execute res = shell.run_cell( File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 540, in run_cell return super().run_cell(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell result = self._run_cell( File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell result = runner(coro) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner coro.send(None) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/var/folders/qy/q7v66x7544q5l2_cnqm0_8y00000gn/T/ipykernel_6556/1510933377.py", line 2, in <module> model.fit( File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1742, in fit tmp_logs = self.train_function(iterator) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1338, in train_function return step_function(self, iterator) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1322, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1303, in run_step outputs = model.train_step(data) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1080, in train_step y_pred = self(x, training=True) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 569, in __call__ return super().__call__(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/functional.py", line 512, in call return self._run_internal_graph(inputs, training=training, mask=mask) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/functional.py", line 669, in _run_internal_graph outputs = node.layer(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler return fn(*args, **kwargs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 321, in call return self.activation(outputs) File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/activations.py", line 321, in relu return backend.relu( File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/backend.py", line 5397, in relu x = tf.nn.relu(x) Node: 'model/conv2d/Relu' Fused conv implementation does not support grouped convolutions for now. [[{{node model/conv2d/Relu}}]] [Op:__inference_train_function_1425]
We encountered this error over and over. It seems that it goes back to the number of channels for the feeding image to the model. It should be grayscale.
We searched for this problem and found this:
https://stackoverflow.com/questions/61796021/unimplementederror-fused-conv-implementation-does-not-support-grouped-convoluti
and
https://stackoverflow.com/questions/73130599/tensorflow-fused-conv-implementation-does-not-support-grouped-convolutions
But they didn't solve our problem.
We guess that the problem should be realated to converting RGB to YUV. In the main model we did that, but we were not able to do that once again on our own dataset. So, probably the model needs to receive just Y, but we are feeding RGB to that.
We will work on that to solve this problem later.
Our Streamlit app can be found at the following location: https://testrepo-y12d7zyos0a.streamlit.app/
The repo for the app is at the following address: https://github.com/dusanBirta/test_repo
The streamlit app functions in the following way. The user is prompted to upload a photo of a person to animate. The photo is passed to our yolov8 model, which identifies the face in the image, and then crops the face found in the bounding box. The image of the face found within the bounding box is then passed to the animate model. The animate model uses First Order Motion to create 3d animation on a 2d image. It works by using a driving video that contains motion on a face, and generates animation with the same motion on the 2d image. The code for the app can be found here: https://github.com/dusanBirta/test_repo/blob/main/app.py
Several issues arose when implementing the streamlit app.
Issue 1: Could not implement all the models into the app.
Issue 2: Cuda not available on streamlit cloud.
Issue 3: Streamlit crashing after full implementation.
In conclusion, our project successfully addresses the challenge of transforming 2D images into captivating 3D representations through an automated and user-friendly approach. By combining YOLOv8 for face detection, a CNN model for image enhancement, and the First Order Motion Model for Image Animation, we have achieved impressive results in generating realistic 3D images and clips from 2D input.
Our innovative algorithm enables a wide range of users to create compelling 3D content without the need for labor-intensive manual processes or specialized software. With the democratization of 3D content creation, our solution holds promising potential across various industries, revolutionizing the way we interact with and experience visual media.
We also discovered that streamlit cloud is not always the best for deploying machine learning models, particularly when they require large resources and contain grater complexity, and for these sorts of apllications and models it may be better to use other options such as Gradio.
This project contributes to the advancement of computer vision and deep learning applications and opens up new possibilities for immersive storytelling and visualization.